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Abstract 

Oh 1 

D ' Transitive text mining - also named Swanson Linking (SL) after its 

^0 , primary and principal researcher - tries to establish meaningful links be- 

tween literature sets which are virtually disjoint in the sense that each 
does not mention the main concept of the other. If successful, SL may 
give rise to the development of new hypotheses. 

■ In this communication we describe our approach to transitive text min- 
ing which employs co-occurrence analysis of the medical subject headings 

CZ2 . (MeSH), the descriptors assigned to papers indexed in PubMed. In addi- 

tion, we will outline the current state of our web-based information system 
which will enable our users to perform literature-driven hypothesis build- 
ing on their own. 
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; 1 Introduction 

O 

| Transitive text mining tries to link the major themes of disjoint literature 

sets. Don Swanson was the first to describe the disclosure of "hidden", i.e. un- 
^ | published but implicit links between concepts not mentioning each other in their 

respective literature representations (Swanson (1986, 1988, 1991)). Later, the 
principle of his method was termed Swanson Linking (SL) which may be defined 

■ as finding disjoint literature partners by establishing meaningful links between 
them using information retrieval from bibliographic databases (Stegmann and 
Grohmann(2003)). 

The published examples of SL involve basically three different sets of lit- 
erature: (i) a problem-based literature - e.g. describing a disease - is referred 
to as "source"; (ii) a literature not being mentioned in the source literature 
but possibly contributing to problem solving is called "target"; (iii) a litera- 
ture representing a major concept which is relevant for and occurs in both, 
source and target literature, is labeled "intermediate" (Swanson and Smalheiser 
(1999)). The discovery process might normally proceed from source to target 
via intermediate; however, the reverse order is naturally conceivable, and any 
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coherent literature set regarded as "intermediate" may be explored for source 
and target concepts simultaneously. 

Different approaches have been developed to detect in the investigated lit- 
erature sets key terms representing possible intermediate and target or source 
concepts. Some authors extract title and abstract words and phrases (Swanson 
and Smalheiser (1999), Gordon and Lindsay (1996), Gordon and Dumais (1998), 
Lindsay and Gordon (1999), Weeber et al. (2001)) or extract the medical subject 
headings (MeSH) assigned to documents indexed in PubMed (Srinivasan (2004)) 
and try to find relevant terms on top of ranked lists. We use the co-word analy- 
sis technique described by Callon et al. (1991) for clustering of extracted MeSH 
terms and for two-dimensional visualisation of the clusters according to their 
internal density and external centrality in so-called " strategical diagrams" . We 
found that in some cases key concepts relevant to the discovery process can be 
identified on the basis of some positional and/or numerical characteristics of the 
respective clusters (Stegmann and Grohmann (2003), Stcgmann and Grohmann 
(2004)). 

In this communication we discuss and value these features with the follow- 
ing examples of transitive text mining: Raynaud's Disease - Fish Oil (Swanson 
(1986)), Migraine - Magnesium (Swanson (1988)), Schizophrenia - Phospholi- 
pase A2 (Smalheiser and Swanson (1998)). 

In addition, we shortly describe our web-based information system which will 
enable our users to analyse the knowledge domain represented by a literature 
set and to perform transitive text mining on their own. 



2 Methods 

PubMed searches were performed as indicated at the legends. The re- 
trieved document sets were downloaded in MEDLINE format. Extraction of 
MeSH terms, subsequent co-occurrence analysis, term clustering and calcula- 
tion of cluster density and centrality were performed as described (Stcgmann 
and Grohmann (2003), Stegmann and Grohmann (2004)). A brief description 
of the cluster process follows: co-occurrence strength of MeSH term pairs was 
calculated as Equivalence Index (Callon et al. (1991)) 

E - -^L 

where Cij is the number of co-occurrences of terms i and j (i.e. the number 
of documents in which terms i and j co-occur), and C\ and Cj are the number 
of occurrences of term i and j, respectively. Multiple occurrences of a MeSH 
term within the MeSH fields of a document (e.g. with different subheadings) are 
ignored. A threshold of > 0.05 was applied. The cluster process starts with 
the term pair exhibiting the highest equivalence index. Of those remaining terms 
having links with the cluster members the term with the highest link strength 
is added to the cluster. Cluster size is limited to 3 - 10 terms. The clusters of 
a literature set are graphically displayed according to their mean internal link 
strength (density) and the sum of their external link strength (centrality). 

The tools for MeSH term extraction, calculation of equivalence indices, clus- 
ter generation and calculation of cluster density and centrality have been pro- 
grammed in PERL and JAVA. The JAVA programs are part of our web-based 
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information system Charite-Mlink. Cluster diagrams were created using the 
software package R (R Development Core Team (2004)). 

3 Results and Discussion 

The strategical diagrams have a two-fold function: (i) they should allow the 
identification of clusters containing terms of potential interest for the transitive 
discovery process; (ii) they represent knowledge domains (as far as they are 
comprised by the database retrieval) which can be analysed in terms of centrality 
indicating the importance of clusters and cluster terms for the whole domain, 
and in terms of density indicating the strength of the local coherence of (sub-) 
themes expressed by the cluster terms (Callon et al. (1991)). 

Figure El El show strategical diagrams of source literatures which allow 
the identification of intermediate terms which in turn are prerequisites for the 
detection of target concepts. Diagrams of intermediate literatures are displayed 
in Figure 01 El El They harbor both, target and source terms. 

3.1 Source literature 

In the diagrams of source literatures the clusters with the terms defining 
the literature are located in regions of high centrality and density, as one can 
expect (Figure ^ El EJ • The clusters containing some of the already known 
(from Swanson's work) intermediate terms are indicated (Figure ^ El- The 
terms Spreading Cortical Depression and Epilepsy, being intermediate for the 
Migraine - Magnesium literature track occur in clusters located in the below- 
median centrality and density region of the diagram, that is in the periphery of 
the knowledge domain "Migraine" (FigureEJ). In contrast, the cluster containing 
the term Blood Viscosity as an intermediate for the Raynaud's Disease - Fish 
Oil literatures has a higher centrality and about median density (Figure [T]). 
For each cluster with a centrality > a centrality / density ratio (cdr) can be 
calculated as the quotient of its centrality and density. Dividing the cdr of the 
source cluster by the cdr of an intermediate cluster gives a Source- Intermediate 
Ratio (SIR). SIR is around 1 for the intermediate terms Blood Viscosity (Figure 
^) and Epilepsy (Figure EJ). However, the SIR of the intermediate Spreading 
Cortical Depression (Figure EJ is quite different from 1. The analysis of other 
source literatures (not shown) also identifies some intermediate terms in clusters 
being located in regions of low density and centrality and/or showing a SIR of 
around 1 so that these characteristics may be taken as indicators where to start 
the screening of the clusters for terms of possible relevance for the discovery 
process. However, the other clusters should be screened, too. Due to the cluster 
method used the members of a cluster show some similarity to each other and 
oftenly define the different aspects of a more general theme which may be helpful 
in generating ideas of intermediate concepts. The diagram of the Schizophrenia 
source literature (Figure EJ may serve as an example: here, the intermediate 
cluster is neither located below the medians nor shows a SIR of around 1. 
In fact, we identified it tentatively as an intermediate because it contains the 
term Platelet Aggregation which is also an intermediate term in the Raynaud's 
Disease and Migraine literature (not shown) and because that term represents 
some physiological property. It is generally a good idea to look for candidate 
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intermediate terms dealing with physiological conditions (Weeber et al. (2001)). 

3.2 Intermediate literature 

Figure El El El show the strategical diagrams of the intermediate literatures 
represented by the intermediate term identified in Figure ^ El El As to be 
expected, the clusters containing the main concept of the literature sets show 
high centrality. The intermediate literatures contain by definition the respective 
source terms because the former was choosen due to the identification of its main 
term from the diagrams of the source literatures. Now, in the diagrams of the 
intermediate literatures the clusters containing a source term may be regarded 
as a guide to possibel target concepts. In the Blood Viscosity literature diagram 
(FigureEJ the cluster containig the target terms Eicosapentaenoic Acid and Fish 
Oils are not only located close together but also show similiar centraliy/density 
ratios which give a quotient (Source- Target Ratio, STR) of about 1. In the 
diagram of the intermediate Spreading Cortical Depression literature set (Figure 
EJ) we also find source and target clusters in close vicinty, the STR, however is 
well above 1. In the diagram of the intermediate Platelet Aggregation literature 
(Figure EJ) source and target clusters are not so close together, but the STR 
value is around 1. All target terms are already known by Swanson's work. 

As with the source literature we can start the screening of intermediate lit- 
erature clusters exhibiting similiar centrality/density ratios and/or being in the 
neighbourhood of source clusters. However, we must also say that some target 
terms are found in clusters outside of this frame (not shown); the screening 
of other clusters is always advisable. In addition, dealing with large litera- 
ture sets consisting of many thousand documents very many clusters have to 
be screened. We did not yet experiment with variable cluster sizes but higher 
number of terms per cluster might affect the cluster readability. Thus, other 
text mining methods should be employed. For example, Gordon and Dumais 
(1998) applied Latent Semantic Analysis (LSA) to the analysis of the Raynaud's 
Disease literature and found relevant intermediate terms at high ranks on list 
generated from title and abstract words. They failed, however, to find target 
terms at equally high positions on lists derived from the intermediate literature 
after LSA treatment. We are currently investigating the usefulness of LSA for 
the analysis of document-by-MeSH-term matrices (in preparation). 

3.3 Charite-Mlink 

Our web-based information system Charite-Mlink enables the user to upload 
PubMed literature sets and to navigate in the information space constituted by 
the MeSH term clusters generated by the system. In addition, the system makes 
suggestions with respect to clusters containing terms of potential relevance to a 
discovery process based on transitive text mining as described in the previous 
sections. The first version of Charite-Mlink has been released in August 2005 
and can be accessed at http://mlink.charite.de/ 
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4 Conclusion 



We have described an approach to transitive text mining which is based 
on the co-occurrence analysis and subsequent clustering of the MeSH terms as- 
signed to PubMed documents. Our results allow some suggestions which clusters 
should be screened at first in a discovery process. Future work is necessary em- 
ploying other text mining methods in order to generate hypotheses which cannot 
derived from one knowledge domain only. 
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Figure 1: Strategical diagram of the Raynaud's Disease^ literature set. 

■fPubMed title search for "raynaud*", publication years 1966-1985. 

No. of documents: 802, no. of distinct MeSH terms: 454, no. of clusters: 44. 

Triangles: indicate cluster positions, dotted lines: indicate medians. 

RD: cluster containing the source term Raynaud's Disease. 

BV: cluster containing the intermediate term Blood Viscosity. 

SIR: source-intermediate ratio. 
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Migraine 
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Figure 2: Strategical diagram of the Migraine^ literature set. 

■i-PubMed title search for "migraine", publication years 1966-1987. 

No. of documents: 2583, no. of distinct MeSH terms: 1021, no. of clusters: 

106. 

Mi: cluster containing the source term Migraine. 

Epi: cluster containing the intermediate term Epilepsy. 

SCD: cluster containing the intermediate term Spreading Cortical Depression. 
Other details: see Figure ^ 
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Figure 3: Strategical diagram of the Schizophrenia^ literature set. 

■i-PubMed title search for "schizophrenia", publication years 1966-1985. 

No. of documents: 6225, no. of distinct MeSH terms: 1598, no. of clusters: 

164. 

Sch: cluster containing the source term Schizophrenia. 

PA: cluster containing the intermediate term Platelet Aggregation. 

Other details: see Figure ^ 
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Blood Viscosity 
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Figure 4: Strategical diagram of the Blood Viscosity^ literature set. 
*PubMed title search for "blood viscosity", publication years 1966-1987. 
No. of documents: 326, no. distinct of MeSH terms: 293, no. of clusters: 39. 
BV: cluster containing the intermediate term Blood Viscosity. 
RD: cluster containing the source term Raynaud's Disease. 
EPA: cluster containing the target terms Eicosapentaenoic Acid and Fish Oils. 
STR: source-target ratio. 
Other details: see Figure ^ 
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Spreading Cortical Depression 
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Figure 5: Strategical diagram of the Spreading Cortical Depression} literature 
set. 

■i-PubMed title search for " spreading cortical depression" , publication years 1966- 
1985. 

No. of documents: 502, no. distinct of MeSH terms: 391, no. of clusters: 30. 
SCD: cluster containing the intermediate term Spreading Cortical Depression. 
Mi: cluster containing the source term Migraine. 
Mg: cluster containing the target terms Magnesium. 
Other details: see Figure Hand Figure 0] 
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Platelet Aggregation 
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Figure 6: Strategical diagram of the Platelet Aggregation} literature set. 
^PubMed title search for "platelet aggregation", publication years 1966-1985. 
No. of documents: 2638, no. distinct of MeSH terms: 1449, no. of clusters: 
148. 

PA: cluster containing the intermediate term Platelet Aggregation. 
Sch: cluster containing the source term Schizophrenia. 
PLA: cluster containing the target terms Phospholipase A. 
Other details: see Figure H an d Figure 0] 



12 



